-
-
Notifications
You must be signed in to change notification settings - Fork 17
Add a python guide which demonstrates using an LLAMA model in a compute #649
Conversation
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
| - Nitric | ||
| - API | ||
| - AI & Machine Learning | ||
| languages: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rebase and add start_steps. See go realtime guide for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rebased, but start steps won't work with this repository - it requires them to download llama separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the model be curled?
Demonsrate how a lightweight llama model can be used with serverless compute
5a7d491 to
d21e3e8
Compare
Co-authored-by: Ryan Cartwright <[email protected]>
| # set 128MB of RAM | ||
| # See lambda configuration docs here: | ||
| # https://docs.aws.amazon.com/lambda/latest/dg/configuration-function-common.html#configuration-memory-console |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # set 128MB of RAM | |
| # See lambda configuration docs here: | |
| # https://docs.aws.amazon.com/lambda/latest/dg/configuration-function-common.html#configuration-memory-console | |
| # set 6GB of RAM | |
| # Lambda vCPUs are proportional to memory allocation. And a larger amount of CPUs will improve LLM inference |
| # set a timeout of 15 seconds | ||
| # See lambda timeout values here: | ||
| # https://docs.aws.amazon.com/lambda/latest/dg/configuration-function-common.html#configuration-timeout-console |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # set a timeout of 15 seconds | |
| # See lambda timeout values here: | |
| # https://docs.aws.amazon.com/lambda/latest/dg/configuration-function-common.html#configuration-timeout-console |
| # # set a provisioned concurrency value | ||
| # # For info on provisioned concurrency for AWS Lambda see: | ||
| # # https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html | ||
| provisioned-concurrency: 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # # set a provisioned concurrency value | |
| # # For info on provisioned concurrency for AWS Lambda see: | |
| # # https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html | |
| provisioned-concurrency: 0 |
| # set the amount of ephemeral-storage: of 512MB | ||
| # For info on ephemeral-storage for AWS Lambda see: | ||
| # https://docs.aws.amazon.com/lambda/latest/dg/configuration-ephemeral-storage.html | ||
| ephemeral-storage: 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update comment to explain why
| response = llama_model( | ||
| prompt=prompt, | ||
| max_tokens=150, | ||
| temperature=0.7, | ||
| top_p=0.9, | ||
| stop=["\n"] | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it's worthwhile, but making these options to show off a bit more configurability could be good.
e.g.
@main.post("/translate")
async def handle_translation(ctx: HttpContext):
# Could still leave max_tokens hardcoded to make sure prompts don't exceed 30s
max_tokens = ctx.req.query.get("max_tokens", default_max_tokens)
preset = ctx.req.query.get("temperature", default_temperature)
text = ctx.req.json["text"]We also support using raw text in the dashboard api testing. So not all prompts need to be wrapped in JSON
| ## Conclusion | ||
|
|
||
| In this guide, we demonstrated how you can use a lightweight machine learning model like Llama with serverless compute, enabling you to efficiently handle real-time translation tasks without the need for constant infrastructure management. | ||
|
|
||
| The combination of serverless architecture and on-demand model execution provides scalability, flexibility, and cost-efficiency, ensuring that resources are only consumed when necessary. This setup allows you to run lightweight models in a cloud-native way, ideal for dynamic applications requiring minimal operational overhead. No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be really cool to follow on from this guide with a websocket chatbot.
| llama_model = Llama(model_path="./models/Llama-3.2-1B-Instruct-Q4_K_M.gguf") | ||
|
|
||
| # Function to perform translation using the Llama model | ||
| def translate_text(text): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think translating text is an interesting use case, but would it also be simpler to just pass through the prompt directly from the users request and allow them to test any prompt? e.g. What is the Capital of France? Especially if the goal is to demonstrate just running these models in serverless compute?
| - python | ||
| --- | ||
|
|
||
| # Using LLama models with serverless infrastructure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A title of "Building AWS LLAMBDAS" just popped into my head, not sure if it's good as the last part is a bit hard to read :P. (I know if applied to other serverless compute as well but an opportunity for wordplay seems hard to pass up).
| # We add more storage to the lambda function, so it can store the model | ||
| ephemeral-storage: 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true? Isn't the model baked into the container already?
|
Guide was retargeted, reviews are stale and will now cause confusion. |
In this guide, we demonstrate how you can use a lightweight machine learning model like Llama with serverless compute. This example performs language translation using Llama-3.2-1B-Instruct-Q4_K_M